Journal of Clinical Epidemiology — Latest Matching Preprints

1

State of play in individual participant data meta-analyses of randomised trials: Systematic review and consensus-based recommendations

Seidler, A. L.; Aagerup, J.; Nicholson, L.; Hunter, K.; Bajpai, R.; Hamilton, D.; Love, T.; Marlin, N.; Nguyen, D.; Riley, R.; Rydzewska, L.; Simmonds, M.; Stewart, L.; Tam, W.; Tierney, J.; Wang, R.; Amstutz, A.; Briel, M.; Burdett, S.; Ensor, J.; Hattle, M.; Libesman, S.; Liu, Y.; Schandelmaier, S.; Siegel, L.; Snell, K.; Sotiropoulos, J.; Vale, C.; White, I.; Williams, J.; Godolphin, P.

2026-02-04 epidemiology 10.64898/2026.02.03.26345481 medRxiv

Top 0.1%

38.9%

Show abstract

BackgroundIndividual participant data (IPD) meta-analyses obtain, harmonise and synthesise the raw individual-level data from multiple studies, and are increasingly important in an era of data sharing and personalised medicine to inform clinical practice and policy. Objectives(1) Describe the landscape of IPD meta-analysis of randomised trials over time; (2) establish current practice in design, conduct, analysis and reporting for pairwise IPD meta-analysis; and (3) derive recommendations to improve the conduct of and methods for future IPD meta-analyses. DesignPart 1: systematic review of all published IPD meta-analyses of randomised trials; Part 2: in-depth review of current methodological practice for pairwise IPD meta-analysis; and Part 3: adapted nominal group technique to derive consensus recommendations for IPD meta-analysis authors, educators and methodologists. Data sourcesMEDLINE, Embase, and the Cochrane Database of Systematic Reviews (via the Ovid interface). Eligibility criteriaPart 1: all IPD meta-analyses of randomised trials published before February 2024, evaluating intervention effects and based on a systematic search. Part 2: all pairwise IPD meta-analyses from part 1 published between February 2022 and February 2024. Part 3: Selected panel of experienced IPD meta-analysis authors and/or methodologists. ResultsPart 1: We identified 605 eligible IPD meta-analyses published between 1991 and 2024. The number of IPD meta-analyses published per year increased over time until 2019 but has since plateaued to about 60 per year. The most common clinical areas studied were cardiovascular disease (n=113, 19%) and cancer (n=110, 18%). The proportion of IPD meta-analyses published with Cochrane decreased over time from 16% (n=31/196) before 2015 to 3% (n=5/196) between 2021-2024. Part 2: 100 recent pairwise IPD meta-analyses were included in the in-depth review. Most cited PRISMA-IPD (68, 68%) and conducted risk of bias assessments (n=82, 82%), with just under half carrying out subgroup analyses not at risk of aggregation bias (n=36/85, 41%). However, only 33% (n=33) and 29% (n=29) respectively provided a protocol or statistical analysis plan, and only 7% (n=6/82) reported using IPD to inform risk of bias assessments. Part 3: 24 experts participated in a consensus workshop. Key recommendations for improved IPD meta-analyses focused on transparency (prospective registration; published protocols and statistical analysis plans) and maximising value (searching trial registries; obtaining IPD for unpublished evidence; using IPD to address missing data and risk of bias). Methodologists and educators should strengthen dissemination of methods and support capacity building across clinical fields and geographical areas. ConclusionsThe application and methodological quality of IPD meta-analyses of randomised trials has increased in the last decade, but shortcomings remain. Implementing our consensus-based recommendations will ensure future IPD meta-analyses generate better evidence for clinical decision making. Study registrationOpen Science Framework (1) Summary boxesO_ST_ABSWhat is already known on this topicC_ST_ABSO_LIIPD meta-analyses of randomised trials are regularly used to inform clinical policy and practice. C_LIO_LIThey can provide better quality data and enable more thorough and robust analyses than standard aggregate data meta-analyses, but are resource-intensive and can be challenging to conduct, leading to variable methodological quality C_LIO_LIPrevious studies that evaluated the conduct of IPD meta-analyses pre-date several major developments, such as the introduction of the PRISMA-IPD reporting guideline. C_LI What this study addsO_LIThis is the most comprehensive assessment of IPD meta-analyses of randomised trials to date (605 studies), showing an increase in publications over time followed by a recent plateau. C_LIO_LIThe conduct of IPD meta-analysis has improved in recent years including increased use of prospective registration, assessment of risk of bias, appropriate analyses of patient subgroup effects and citing the PRISMA-IPD statement. C_LIO_LIMany shortcomings remain including (i) insufficient pre-specification of methods such as outcomes and analyses, (ii) sub-standard transparency (including publication of protocols, statistical analysis plans and reporting of analyses), and (iii) failure to gain maximum value of IPD (i.e. include unpublished trials, use the IPD to inform risk of bias and trustworthiness assessments, and address missing data appropriately); expert consensus recommendations are provided for how to address these gaps. C_LI

2

An Empirical Assessment of Inferential Reproducibility of Linear Regression in Health and Biomedical Research Papers

Jones, L.; Barnett, A.; Hartel, G.; Vagenas, D.

2026-04-07 health systems and quality improvement 10.64898/2026.04.07.26350296 medRxiv

Top 0.1%

33.4%

Show abstract

Background: In health research, variability in modelling decisions can lead to different conclusions even when the same data are analysed, a challenge known as inferential reproducibility. In linear regression analyses, incorrect handling of key assumptions, such as normality of the residuals and linearity, can undermine reproducibility. This study examines how violations of these assumptions influence inferential conclusions when the same data are reanalysed. Methods: We randomly sampled 95 health-related PLOS ONE papers from 2019 that reported linear regression in their methods. Data were available for 43 papers, and 20 were assessed for computational reproducibility, with three models per paper evaluated. The 14 papers that included a model at least partially computationally reproduced were then examined for inferential reproducibility. To assess the impact of assumption violations, differences in coefficients, 95% confidence intervals, and model fit were compared. Results: Of the fourteen papers assessed, only three were inferentially reproducible. The most frequently violated assumptions were normality and independence, each occurring in eight papers. Violations of independence were particularly consequential and were commonly associated with inferential failure. Although reproduced analyses often retained the same binary statistical significance classification as the original studies, confidence intervals were frequently wider, indicating greater uncertainty and reduced precision. Such uncertainty may affect the interpretation of results and, in turn, influence treatment decisions and clinical practice. Conclusion: Our findings demonstrate that substantial violations of key modelling assumptions often went undetected by authors and peer reviewers and, in many cases, were associated with inferential reproducibility failure. This highlights the need for stronger statistical education and greater transparency in modelling decisions. Rather than applying rigid or misinformed rules, such as incorrectly testing the normality of the outcome variable, researchers should adopt modelling frameworks guided by the research question and the study design. When assumptions are violated, appropriate alternatives, such as robust methods, bootstrapping, generalized linear models, or mixed-effects models, should be considered. Given that assumption violations were common even in relatively simple regression models, early and sustained collaboration with statisticians is critical for supporting robust, defensible, and clinically meaningful conclusions.

3

Challenges in the Computational Reproducibility of Linear Regression Analyses: An Empirical Study

Jones, L. V.; Barnett, A.; Hartel, G.; Vagenas, D.

2026-04-07 health systems and quality improvement 10.64898/2026.04.07.26350286 medRxiv

Top 0.1%

28.4%

Show abstract

Background: Reproducibility concerns in health research have grown, as many published results fail to be independently reproduced. Achieving computational reproducibility, where others can replicate the same results using the same methods, requires transparent reporting of statistical tests, models, and software use. While data-sharing initiatives have improved accessibility, the actual usability of shared data for reproducing research findings remains underexplored. Addressing this gap is crucial for advancing open science and ensuring that shared data meaningfully support reproducibility and enable collaboration, thereby strengthening evidence-based policy and practice. Methods: A random sample of 95 PLOS ONE health research papers from 2019 reporting linear regression was assessed for data-sharing practices and computational reproducibility. Data were accessible for 43 papers. From the randomly selected sample, the first 20 papers with available data were assessed for computational reproducibility. Three regression models per paper were reanalysed. Results: Of the 95 papers, 68 reported having data available, but 25 of these lacked the data required to reproduce the linear regression models. Only eight of 20 papers we analysed were computationally reproducible. A major barrier to reproducing the analyses was the great difficulty in matching the variables described in the paper to those in the data. Papers sometimes failed to be reproduced because the methods were not adequately described, including variable adjustments and data exclusions. Conclusion: More than half (60%) of analysed studies were not computationally reproducible, raising concerns about the credibility of the reported results and highlighting the need for greater transparency and rigour in research reporting. When data are made available, authors should provide a corresponding data dictionary with variable labels that match those used in the paper. Analysis code, model specifications, and any supporting materials detailing the steps required to reproduce the results should be deposited in a publicly accessible repository or included as supplementary files. To increase the reproducibility of statistical results, we propose a Model Location and Specification Table (MLast), which tracks where and what analyses were performed. In conjunction with a data dictionary, MLast enables the mapping of analyses, greatly aiding computational reproducibility.

4

Time-to-retraction and likelihood of evidence contamination (VITALITY Extension I): a retrospective cohort analysis

Yuan, Y.; Peng, Z.; Doi, S. A. R.; Furuya-Kanamori, L.; Cao, H.; Lin, L.; Chu, H.; Loke, Y.; Mol, B. W.; Golder, S.; Vohra, S.; Xu, C.

2026-02-24 epidemiology 10.64898/2026.02.20.26346631 medRxiv

Top 0.1%

23.7%

Show abstract

BackgroundThe number of problematic randomized clinical trials (RCTs) has risen sharply in recent decades, posing serious challenges to the integrity of the healthcare evidence ecosystem. ObjectiveTo investigate whether retraction of problematic RCTs could reduce evidence contamination. DesignRetrospective cohort study SettingA secondary analysis of the VITALITY Study database. Participants1,330 retracted RCTs with 847 systematic reviews. MeasurementsThe difference in the median number (and its interquartile, IQR) of contamination before and after retraction. The association between time-to-retraction and likelihood of evidence contamination. ResultsAmong these retracted RCTs, 426 led to evidence contamination, resulting in 1,106 contamination events (251 after retraction vs. 855 before retraction). The time interval between RCT publication and first contamination ranged from 0.2 to 30.9 years, with a median of 3.3 years (95% CI: 3.0 to 3.9). The median number of contaminated systematic reviews was lower after retraction than before retraction (0, IQR: 0 to 1 vs. 1, IQR: 1 to 2, P < 0.01). Compared with trials retracted more than 7.5 years after publication, those retracted between 1.0 and 1.8 years (OR = 0.70, 95% CI: 0.60 to 0.80) and retracted within 1.0 year (OR = 0.69, 95% CI: 0.60 to 0.80) were associated with lower likelihood of evidence contamination. LimitationsOnly assessed contaminated systematic reviews with quantitative synthesis and limited to retracted RCTs. ConclusionsRetracting problematic RCTs can significantly reduce evidence contamination, and faster retraction was associated with less contamination. To safeguard the integrity of the evidence ecosystem, academic journals should act promptly in the retraction of problematic studies to minimize their downstream impact. Primary Funding SourcesThe National Natural Science Foundation of China (72204003, 72574229)

5

Does the sensitivity- and precision-maximizing RCT filter find all 'included' records retrieved by the sensitivity-maximizing filter on Ovid MEDLINE? An investigation using 14 Cochrane reviews

Fulbright, H. A.; Marshall, D.; Evans, C.; Corbett, M.

2026-03-23 health informatics 10.64898/2026.03.20.26348876 medRxiv

Top 0.1%

22.5%

Show abstract

ObjectivesTo inform users about the impact of two updated study filters for limiting database search results to randomized controlled trials on Ovid MEDLINE: a sensitivity-maximizing version (SM) and a sensitivity-and-precision-maximizing version (SaPM). To provide an updated understanding of how they compare to each other. MethodsUsing the final included records of 14 Cochrane reviews that had used the SM filter, we determined how many available records on Ovid MEDLINE would have been retrieved with each filter; investigated why records were missed; the unique yield; precision; and number-needed-to-read (NNR) for each filter. We also performed forwards and backwards citation searching on missed records (to determine if this could mitigate the risk of missing includes) and calculated the percentage change in the overall number-needed-to-screen (ONNS) when applying each filter to reproduction strategies. ResultsOn average, the SaPM filter reduced ONNS by 83% and retrieved 95.9% of includes compared with 98.2% retrieved by the SM filter. The SaPM filter offered a further 28.2% mean reduction in ONNS over the SM filter. The SM filter had a unique yield of 12 and a precision of 1.5%, versus a unique yield of three and precision of 4.4% for the SaPM filter. NNR was 68 for the SaPM filter versus 189 for the SM filter. ConclusionThe SaPM filter reduced the screening burden with minimal risk of missing eligible records (which could be mitigated by citation searching). Decisions about which filter to use should consider both the needs and resources of the review.

6

Collaborative large language models (LLMs) are all you need for screening in systematic reviews

Parmar, M.; Naqvi, S. A. A.; Warraich, K.; Saeidi, A.; Rawal, S.; Faisal, K. S.; Kazmi, S. Z.; Fatima, M.; He, H.; Safdar, M.; Liu, W.; Haddad, T.; Wang, Z.; Murad, M. H.; Baral, C.; Riaz, I. B.

2026-02-17 health informatics 10.64898/2026.02.07.26345640 medRxiv

Top 0.1%

19.5%

Show abstract

BackgroundThe ability of large language models (LLMs) to work collaboratively and screen studies in a systematic review (SR) is under-explored. Hence, we aimed to evaluate the effectiveness of LLMs in automating the process of screening in systematic reviews. MethodsThis is an observational study which included labeled data (title and abstracts) for five SRs. Originally, two reviewers screened the citations independently for eligibility. A third reviewer cross-checked each citation for quality assurance. GPT-4, Claude-3-Sonnet, and Gemini-Pro-1.0 were used using zero-shot chain-of-thought prompting. Collaborative approaches included (i): conflict resolution using benefit of the doubt, (ii) majority voting using an independent third LLM and (iii) conflict resolution using an informed third LLM. Performance was assessed using accuracy, precision for exclusion, and recall for inclusion. Work saved over samples (WSS) was computed to estimate the reduction in manual human effort. ResultsA total of 11300 articles were included in this study. The individual models, GPT-4, Claude-3-Sonnet, and Gemini-Pro-1.0 exhibited a high precision for exclusion, achieving 99.7%, 99.7%, and 99.2% and high recall for inclusion achieving 95.5%, 96.6% and 85.7%, respectively. However, the collaborative approach utilizing the two best-performing models (GPT-4 and Claude-3S) achieved an average precision of 99.9% and a recall of 98.5% (across all collaborative approaches). Furthermore, the proposed collaborative approach resulted in an average WSS of 63.5%, compared to the average WSS of 45.2% for individual models. Conversational LLM interactions showed a consistent pattern of results. LimitationsThis study was limited due to reliance on proprietary models, and evaluation on oncology datasets. ConclusionEvidence shows that collaborative LLMs enable efficient, high-performing screening in systematic reviews, supporting continuous evidence updates. Primary funding sourceNIH (U24CA265879-01-1) and Carolyn-Ann-Kennedy-Bacon Fund.

7

MedDRA Adoption and Adverse Event Reporting Quality in Gastrointestinal and Abdominal Surgery Randomized Controlled Trials: A Cross-Sectional Analysis

Camasso, N.; Kirby, K.; Calvert, N.; Stroup, J.; Langerman, R.; Vassar, M.

2026-02-05 health systems and quality improvement 10.64898/2026.02.04.26345608 medRxiv

Top 0.1%

19.2%

Show abstract

IntroductionAdverse event (AE) reporting transparency is essential for evidence-based surgical practice, yet substantial reporting gaps persist despite Consolidated Standards for Reporting Trials (CONSORT) Harms guidance. The Medical Dictionary for Regulatory Activities (MedDRA) provides standardized terminology for AE classification, but its association with AE reporting quality remains unexplored. ObjectivesThe purpose of this study was to establish the frequency of Medical Dictionary for Regulatory Activities (MedDRA) utilization in gastrointestinal and abdominal surgical trials, identify predictors of its adoption, and quantify the association between MedDRA use and adverse event reporting quality as measured by Completeness scores, registry-publication Concordance, and overall Transparency indices. DesignCross sectional analysis of matched randomized controlled trial registry-publication pairs. Participants116 gastrointestinal and abdominal surgery randomized controlled trials registered on ClinicalTrials.gov with results posted between September 2009 and December 2024 and an associated peer-reviewed publication. Primary and Secondary Outcome MeasuresPrimary outcomes were differences in AE reporting quality between MedDRA-documenting and non-documenting trials, measured using Harms Reporting Completeness score (0-8), Concordance score (0-7), and Harms Transparency Index (0-15). Secondary outcomes included prevalence of MedDRA adoption and predictors of MedDRA documentation via univariable logistic regression. ResultsAmong 116 included trials, only 22 (18.8%) explicitly documented MedDRA use. Industry-funded trials (OR=29.32, 95% CI=8.94-118.50, p<0.001) and those with at least one U.S. site (OR=4.59, 95% CI=1.22-30.02, p=0.050) demonstrated significantly higher rates of MedDRA adoption. Trials documenting MedDRA use demonstrated significantly improved reporting across all three score parameters: Completeness score (p<0.001), Concordance score (p=0.002), and Transparency Index (p<0.001). MedDRA use was also associated with lower rates of registry-publication discordance across key safety metrics: serious adverse event (SAE) participant count registry-publication discordance was 59.1% in MedDRA documenting trials and 85.1% in non-MedDRA trials; mortality reporting discordance was 60.0% in MedDRA trials and 82.1% in non-MedDRA trials. ConclusionDespite strong association with improved AE reporting completeness and registry-publication concordance, MedDRA adoption in gastrointestinal and abdominal surgical trials remains below 20%, concentrated among industry-funded studies. The predominance of unstandardized terminology and free-text strategies promotes reporting inadequacies that complicate evidence synthesis and undermine evidence-based surgical practice. Journals, funding agencies, academic institutions, and researchers should prioritize the adoption of standardized AE terminology to enhance transparency and improve surgical research. Trial RegistrationPROSPERO CRD420251081191. Strengths and Limitations of this StudyO_LIThis is the first study to quantify the association between MedDRA use and adverse event reporting quality in surgical trials C_LIO_LIDual independent screening and extraction with pre-registered protocol minimizes bias and enhances reproducibility C_LIO_LIAnalysis limited to gastrointestinal and abdominal surgery; generalizability to other surgical subspecialties remains uncertain C_LIO_LIRequired explicit MedDRA documentation; trials using MedDRA without disclosure would be misclassified as non-users C_LIO_LIConcordance assessment examined numerical agreement without evaluating clinical significance of discrepancies C_LI

8

Changes in Evidence Used for FDA Novel Drug Approvals Following the Implementation of the 21st Century Cures Act

Kaplan, R. M.; Narayan, A.; Irvin, V. L.; Koong, A. J.; Song, S.

2026-02-03 scientific communication and education 10.64898/2026.01.31.702992 medRxiv

Top 0.1%

13.0%

Show abstract

BackgroundThe 21st Century Cures Act (2017) expanded FDA flexibility in applying methodological standards for drug approval. To examine trends before and after implementation, we independently reviewed all novel drugs approved between 2016 and 2024. MethodsWe constructed a database of all novel FDA approvals from January 1, 2016, through December 31, 2024. Each study linked to an approved drug (N=6,763) was cataloged by study number, sponsorship, and timing of results reporting relative to completion. ResultsSince 2016, the number of studies supporting approval has steadily declined. Beginning in 2017, the modal number of studies per approval fell to one. Industry sponsorship increased while NIH-supported studies decreased. Average time to public posting of results exceeded the one-year statutory limit. ConclusionsAfter implementation of the Cures Act, FDA approvals have relied on fewer, increasingly industry-sponsored studies. Although this may accelerate access to new therapies, it raises concerns about the strength of evidence for safety and effectiveness.

9

Accuracy and efficiency of using artificial intelligence for data extraction in systematic reviews. A noninferiority study within reviews

Lee, D. C. W.; O'Brien, K. M.; Presseau, J.; Yoong, S.; Lecathelinais, C.; Wolfenden, L.; Thomas, J.; Arno, A.; Hutton, B.; Hodder, R. K.

2026-02-27 public and global health 10.64898/2026.02.25.26347053 medRxiv

Top 0.1%

12.7%

Show abstract

BackgroundSystematic reviews are important for informing public health policies and program selection; however, they are time- and resource-intensive. Artificial intelligence (AI) offers a solution to reduce these labour-intensive requirements for various aspects of systematic review production, including data extraction. To date, there is limited robust evidence evaluating the accuracy and efficiency of AI for data extraction. This study within a review (SWAR) aimed to determine whether human data extraction assisted by an AI research assistant (Elicit(R)) is noninferior to human-only data extraction in terms of accuracy (i.e. agreement) and time-to-completion. Secondary aims included comparing error types and costs. MethodsA two-arm noninferiority SWAR was conducted to compare AI-assisted and human-only data extraction from 50 RCTs chronic disease interventions. Participants were randomised to extract all data required for conducting a review, using either the AI-assisted or human-only method. Accuracy was assessed using a three-point rubric by an independent assessor blinded to group allocation, based on agreement between extracted data and the assessor. Accuracy scores were standardized to a 0-100 scale. Analysis included overall and subgroup accuracy (data group and data type) using paired t-tests. Time-to-completion was self-reported by data extractors. Type of errors were coded by type and severity, and costs were calculated for data extraction, preparation of files, training and the Elicit(R) Pro subscription. ResultsThere was no difference in overall accuracy between the AI-assisted and human-only arms (mean difference (MD) 0.57 (on a 0-100 scale), 95% confidence interval (CI) -1.29, 2.43). Subgroup analysis by data group found AI-assisted to be more accurate than human-only data extraction for data variables describing intervention and control group (MD 4.75, 95% CI 2.13, 7.38), but otherwise no subgroup differences were observed. AI-assisted data extraction was significantly faster (MD 24.82 mins, 95% CI 18.80, 30.84). The AI-assisted arm made similar error types (missed or omitted data: AI-assisted 3.6%, human-only 3.4%) and severity (minor errors: AI-assisted 6.7%, human-only 6.5%) and cost $181.98 less than the human-only data extraction across the 50 studies. ConclusionAI-assisted data extraction using Elicit(R) showed noninferior accuracy, faster completion times, similar error types and severity, and lower costs compared to human-only extraction. These efficiency gains, without loss in accuracy suggest AI-assisted data extraction can replace one human-only data extractor in future systematic reviews of RCTs. Future research should explore different models of AI data extraction such as two AI-assisted extractors or AI-only extractor with human-only extractor, and comparison of AI-assisted to AI-only.

10

TrialScout links published results to trial registrations using a large language model

Ahnström, L.; Bruckner, T.; Aspromonti, D. A.; Caquelin, L.; Cummins, J.; DeVito, N. J.; Axfors, C.; Ioannidis, J. P. A.; Nilsonne, G.

2026-03-17 epidemiology 10.64898/2026.03.15.26348383 medRxiv

Top 0.1%

12.5%

Show abstract

BackgroundMultiple stakeholders need to locate results of registered clinical trials but frequently struggle to find them. Summary results of clinical trials are often not published in trial registries, and publications containing trial results are often not explicitly linked to their respective trial registrations. Finding these results is important to researchers, systematic reviewers, research funders, regulators, clinical practitioners, and patients. MethodsWe developed TrialScout, a computer program that uses a large language model to match clinical trials registered on ClinicalTrials.gov with corresponding result publications indexed in PubMed. TrialScouts performance was evaluated through comparison to human-coded matches from previous studies of results reporting rates. Subsequently, TrialScout was applied to a random sample of 9,600 completed or terminated trials. ResultsTrialScout had a sensitivity of 92.5% and a specificity of 81.2% compared to human coders. Manual review of 200 cases where TrialScout disagreed with human researchers showed that a majority (123/200, 61.5%, 95% CI, 54.4-68.3%) of disagreements were due to human errors. When used on 9,600 sampled trials in ClinicalTrials.gov, TrialScout found result publications for 6,110 (63.6%) of trials. DiscussionTrialScout reliably located results of completed clinical trials. The tool offers benefits in terms of speed and efficiency. Estimating TrialScouts accuracy is limited by the lack of a true gold standard. TrialScout can accelerate the process of locating trial results in the scientific literature and can assist in monitoring trial reporting practices.

11

Transportability of missing data models across study sites for research synthesis

Thiesmeier, R.; Madley-Dowd, P.; Ahlqvist, V.; Orsini, N.

2026-03-10 epidemiology 10.64898/2026.03.09.26347913 medRxiv

Top 0.1%

10.2%

Show abstract

IntroductionSystematically missing covariates are a common challenge in medical research synthesis of quantitative data, particularly when individual participant data cannot be shared across study sites. Imputing covariate values in studies where they are systematically unobserved using information from sites where the covariate is observed implicitly assumes similarity of associations across studies. The behaviour of this assumption, and the bias arising from violating it, remains difficult to qualitatively reason about. Here, we evaluated a two-stage imputation approach for handling systematically missing covariates using simulations across a range of statistical and causal heterogeneity scenarios. MethodsWe conducted a simulation study with varying degrees of between-study heterogeneity and systematic differences in model parameters. A binary confounder was set to systematically missing in half of the studies. Study-specific effect estimates were combined using a two-stage meta-analytic model. The performance of the imputation approach was evaluated with the primary estimand being the pooled conditional confounding-adjusted exposure effect across all studies. ResultsBias in the pooled adjusted effect estimate was small across scenarios with low to substantial between-study heterogeneity. Bias increased monotonically with increasingly pronounced differences in causal structures across study sites. Coverage remained close to the nominal level under low to substantial between-study heterogeneity, but deteriorated markedly as differences in causal structures between study sites became more severe. ConclusionThe two-stage cross-site imputation approach produced valid pooled effect estimates across a wide range of simulated scenarios but showed monotonic sensitivity to differences in causal structures across studies. The results provide insight into the conditions under which cross-site imputation may be appropriate for handling systematically missing covariates in research synthesis.

12

Assessing the Secondary Use and Scientific Impact of Shared Clinical Trial Data: A Cross-Sectional Study of Clinical Trials Shared on the YODA Project Platform

Taherifard, E.; Mooghali, M.; Hakimian, H. R.; Mane, S. R.; Fu, M.; Bamford, S.; Berlin, J. A.; Childers, K.; Desai, N. R.; Gross, C. P.; Hewens, D.; Lehman, R.; Ritchie, J. D.; Sargood, T.; Waldstreicher, J.; Wallach, J. D.; Willeford, M. K.; Krumholz, H. M.; Ross, J. S.

2026-03-26 public and global health 10.64898/2026.03.26.26349328 medRxiv

Top 0.1%

10.0%

Show abstract

ObjectiveTo assess the number, timing of publication, characteristics, and scientific impact of secondary publications generated using individual participant-level data (IPD) from a portfolio of Johnson & Johnson-sponsored clinical trials shared with external investigators through a data sharing platform. DesignCross-sectional study. SettingYale University Open Data Access (YODA) Project platform. ParticipantsJohnson & Johnson-sponsored clinical trials listed on the YODA Project platform with IPD available for external sharing as of December 31, 2021, and with a full-length, peer-reviewed publication (i.e., primary publication) reporting primary endpoint results by the original trial investigators. Main outcome measuresNumber, timing of publication, research objectives, analysis type, and scientific impact of secondary publications using IPD from these trials identified through citation searches of primary publications in Web of Science through June 2025. Scientific impact metrics included journal impact factor, annual citation count, annual Altmetric Attention Score, and annual Mendeley reader count. Secondary publications were classified as internal (authored by at least one original trial investigator) or external. ResultsAmong 336 eligible trials, 265 (78.9%) had at least one associated secondary publication, totaling 1,167 secondary publications, of which 209 (17.9%) were external. Among external secondary publications for which the data access mechanism was reported (n=190; 90.9%), most obtained access through data sharing platforms (n=161; 84.7%), primarily the YODA Project (n=157; 82.6%). All secondary publications published from 3 years before through the first 2 years after the primary publication (n=161) were internal (100%). Over time, however, external publications increased steadily, exceeding 50% of all secondary publications by year 11 and thereafter. External secondary publications were more frequently pooled analyses (151/209 [72.2%] vs 534/958 [55.7%]; P<0.001). Predictive or prognostic modelling (108/209 [51.7%] vs 322/958 [33.6%]; P<0.001), development of statistical models or algorithms (60/209 [28.7%] vs 114/958 [11.9%]; P<0.001), and validation of existing methods, models, or risk scores (32/209 [15.3%] vs 66/958 [6.9%]; P<0.001) were more frequent among external than internal secondary publications. Compared to internal secondary publications, external secondary publications were published in journals with higher impact factors (median, 6.7 [IQR, 3.4-16.6] vs 4.6 [2.9-10.2]; P=0.002) and had higher annual Altmetric Attention Scores (median, 2.1 [0.7-7.1] vs 0.6 [0.3-2.3]; P<0.001), but lower annual citation counts (median, 2.7 [1.1-5.6] vs 3.4 [1.6-7.5]; P<0.001) and were less likely to be cited in clinical guidelines (21/184 [11.4%] vs 235/805 [29.2%], P<0.001) or policy documents (14/184 [7.6%] vs 206/805 [25.6%], P<0.001); there was no difference in annual Mendeley reader counts (median, 7.4 [3.9-13.0] vs 8.0 [5.1-13.6], P=0.13). ConclusionsClinical trial data shared with external investigators through a data sharing platform generated substantial and sustained secondary research by both original trial investigators and external investigators. The proportion of secondary publications from any clinical trial generated by external investigators increased over time as external investigators pursued complementary research objectives that achieved a comparable scientific impact. Structured data sharing mechanisms may further enhance the scientific impact of clinical trials. What is already known on this topicO_LISharing individual participant-level data (IPD) from clinical trials can promote transparency, reproducibility, and secondary research. C_LIO_LISeveral initiatives, including the Yale University Open Data Access (YODA) Project and government-supported data sharing platforms, provide external investigators with access to clinical trial data. C_LIO_LIWhile prior evaluations of secondary research generated from shared clinical trial data suggest that external investigators publications have citation impacts comparable to those of original trial investigators, overall evidence remains limited. C_LI What this study addsO_LIAnalysis of 336 industry-sponsored clinical trials with IPD shared through the YODA Project showed that most generated secondary publications, by both original trial investigators and external investigators. C_LIO_LIThe proportion of secondary publications from any clinical trial generated by external investigators increased over time, and compared with those generated by the original trial investigators, these publications more frequently use pooled analyses and focus on predictive or prognostic modelling and the development and validation of statistical methods. C_LIO_LISecondary publications generated by external investigators were more often published in higher-impact journals and received higher Altmetric Attention Scores, but had lower annual citation counts and were less likely to be cited in clinical guidelines or policy documents than those generated by the original trial investigators. C_LI

13

Limiting to English language records: A comparison of five methods on Ovid MEDLINE and Embase versus removal during screening

Fulbright, H. A.; Morrison, K.

2026-03-20 health informatics 10.64898/2026.03.18.26348470 medRxiv

Top 0.1%

8.6%

Show abstract

Background: For evidence syntheses using English language limits, several different methods and approaches are available. Objective: To understand the English language (EL) limits available on Ovid MEDLINE and Embase and the application of language metadata on these databases. To compare the impact of five EL limits versus removing non-English language (NEL) records during screening. Methods: Using the records included at full text screening or excluded on NEL status during screening in seven evidence syntheses, we tested five EL limits on 1,509 MEDLINE and 1,584 Embase records. 'Includes' removed or 'NEL excludes' retrieved were investigated. Results: All EL limits performed identically, 99.8% of MEDLINE 'includes' were retrieved versus 99.7% on Embase. All five 'includes' incorrectly removed with EL limits had language metadata errors. Although 98.2% MEDLINE and 94.6% Embase 'NEL excludes' were removed with EL limits, eight MEDLINE and nine Embase records were available in English. Discussion: The risk of excluding potentially eligible records due to language restrictions (whether applied during the strategies or screening) could be mitigated with forward and backward citation searching. Conclusion: EL limits risk removing records with incorrect language metadata. However, EL records might also be excluded on language during screening.

14

From Protocol to Analysis Plan: Development and Validation of a Large Language Model Pipeline for Statistical Analysis Plan Generation using Artificial Intelligence (SAPAI)

Jafari, H.; Chu, P.; Lange, M.; Maher, F.; Glen, C.; Pearson, O. J.; Burges, C.; Martyn, M.; Cross, S.; Carter, B.; Emsley, R.; Forbes, G.

2026-03-19 health systems and quality improvement 10.64898/2026.03.19.26348626 medRxiv

Top 0.1%

8.3%

Show abstract

Background: Statistical Analysis Plans (SAPs) are essential for trial transparency and credibility but are resource-intensive to produce. While Large Language Models (LLMs) have shown promise in drafting protocols, their ability to generate high-quality, protocol-compliant SAPs remains untested against current content guidance. This study developed and validated an LLM-based pipeline for drafting SAPs from clinical trial protocols. Methods: We developed a structured, section-by-section prompting pipeline aligned with standard SAP guidance. We applied this pipeline to nine clinical trial protocols using three leading LLMs: OpenAI GPT-5, Anthropic Claude Sonnet 4, and Google Gemini 2.5 Pro. The resulting 27 SAPs were evaluated against a 46-item quality checklist derived from the published SAP guidelines. Items were double-scored by independent trial statisticians on a 0 to 3 scale for accuracy. We compared performance across LLMs and between item types (descriptive vs. statistical reasoning) using mixed-effects logistic regression. Results: Across 9 trials, the models produced SAP drafts with high overall accuracy (77% to 78%), with no difference in performance between the three LLMs (p=0.79) but varied by content type (p < 0.001). All models performed well on descriptive items (e.g., administrative details, trial design), with lower accuracy for items requiring statistical reasoning (e.g., modelling strategies, sensitivity analyses). Accuracy for statistical items ranged from 67% to 72%, whereas descriptive items achieved 81% to 83% accuracy. Qualitatively, models were prone to specific failure modes in complex sections, such as omitting necessary details for secondary outcome models or hallucinating sensitivity analyses. Discussion: Current LLMs can effectively draft portions of SAPs, offering the potential for substantial time savings in trial documentation. However, a human-in-the-loop approach remains mandatory; while models demonstrate strong capability in producing descriptive content, their independent application to complex statistical methodology design still requires further methodological development and training. Future work should explore advanced prompt engineering, such as retrieval-augmented generation or agentic workflows, to improve reasoning capabilities.

15

When Survival Improves But Quality of Life Does Not: A Model-Based Meta-Analysis of Immune Checkpoint Inhibitors

Sun, Y.; Chang, S.; Tang, K.; LeBlanc, M. R.; Palmer, A. C.; Ahamadi, M.; Zhou, J.

2026-03-05 oncology 10.64898/2026.03.04.26347610 medRxiv

Top 0.1%

7.3%

Show abstract

BackgroundIn immune checkpoint inhibitor (ICI) trials, overall survival (OS) benefits are well established, yet improvements in quality of life (QoL) are often inconsistent or absent in conventional analyses. This apparent discordance raises important questions: are QoL outcomes truly unrelated to survival, and how can QoL results be better utilized and interpreted? MethodsA model-based meta-analysis (MBMA) of longitudinal EORTC QLQ-C30 global health status/quality of life data from randomized ICI trials was conducted. Longitudinal QoL trajectories were analyzed using a nonlinear mixed-effects model to estimate treatment-related toxicity and long-term QoL improvement. Associations between QoL trajectory parameters and OS were assessed using spearman rank correlation tests and Cox proportional hazards models. ResultsTwenty-seven studies (8,149 ICI and 5,593 control patients) contributed longitudinal QoL data, and 18 studies provided matched OS data. Raw QoL trajectories showed overlap between treatment arms, while OS consistently favored ICIs. MBMA revealed that ICIs had similar toxicity but significantly faster QoL improvement than control therapies (p < 0.0001). Baseline QoL, toxicity, and QoL improvement rate were all significantly associated with OS (p < 0.001). MBMA-based QoL comparisons were more sensitive in detecting associations with survival than raw QoL data, with the strongest association observed at Week 24 (R = -0.37, p = 0.067). ConclusionsConventional analyses comparing QoL at a single time point may obscure meaningful patient-reported benefits. By capturing longitudinal QoL trajectories across trials, MBMA reveals how patient experience evolves alongside survival outcomes and supports improved interpretation and utilization of QoL data in treatment evaluation.

16

Does the type of publisher response to integrity concerns influence subsequent citations? A cohort study.

Studd, H.; Avenell, A.; Grey, A.; Bolland, M.

2026-02-27 health informatics 10.64898/2026.02.25.26346683 medRxiv

Top 0.1%

7.2%

Show abstract

BackgroundJournals may respond to integrity concerns by publishing an editorial response (editorial notice, expression of concern (EoC) or retraction). We investigated whether the type of editorial response affected citation rates. MethodsWe obtained citations for 172 randomised controlled trials (RCTs) with integrity concerns (41 had editorial notices, 38 EoCs and 23 retractions) and control RCTs from the same journal and year. Monthly citation rates up to 60 months before and after editorial responses were compared by editorial response type, and to citation decline in control RCTs. Results172 RCTs had 10,603 citations from 6,376 articles. 3,330 control trials were identified for 151/172 RCTs (15,948 citations, 87,811 articles). For both groups, citations increased steadily, peaking 45-65 months post-publication. There were no statistically significant differences in citation decline post-editorial response for trials receiving a retraction, EoC, or notice. Citations were lower in controls than index trials, so analyses were restricted to 1598 highly cited (>25) controls. The rate of decline for highly cited control trials was not statistically significantly different from the post-editorial response rate for index groups. ConclusionCitation rate decline after editorial responses did not differ by type of editorial response nor from the natural decline in control trials. HighlightsO_LIJournals may respond to integrity concerns by issuing an editorial notice. C_LIO_LIThe effect of expressions of concern or other editorial notices on citation patterns is unclear. C_LIO_LIEditorial notices did not accelerate citation decline compared with control trials. C_LIO_LIThe type of notice was not associated with differences in citation decline. C_LIO_LILate editorial notices appear ineffective in preventing continued citation. C_LI

17

Protocol for the development of a tool (INSPECT-IPD) to identify problematic randomised controlled trials when individual participant data are available

Heal, C.; Bero, L.; Antoniou, G. A.; Au, N.; Aviram, A.; Berghella, V.; Bordewijk, E. M.; Bramley, P.; Brown, N. J. L.; Clarke, M.; Fiala, L.; |Grohmann, S.; Gurrin, L. C.; Hayden, J. A.; Hunter, K. E.; Hussey, I.; Kahan, B. C.; Lensen, S.; Lundh, A.; O'Connell, N. E.; Parker, L.; Lam, E.; Meyerowitz-Katz, G.; Naudet, F.; Redman, B. K.; Sheldrick, K.; Sydenham, E.; van Wely, M.; Wang, R.; Wjst, M.; Kirkham, J. J.; Wilkinson, J. D.

2026-02-07 health systems and quality improvement 10.64898/2026.02.06.26345217 medRxiv

Top 0.1%

6.7%

Show abstract

IntroductionRandomised controlled trials (RCTs) investigate the safety and efficacy of interventions. It has become clear however that some RCTs include fabricated data. The INSPECT-SR tool assesses the trustworthiness of RCTs in systematic reviews of healthcare-related interventions. However, where individual participant data (IPD) can be obtained, a more thorough assessment of trustworthiness is possible. Consequently, INSPECT-SR recommends obtaining IPD to resolve uncertainties, though there is no consensus on appropriate methods for forensic analysis of raw data. Our aim is to evaluate IPD checks to establish which are worthwhile, and how they can be implemented in a new tool, INSPECT-IPD (Investigating Problematic Clinical Trials with Individual Participant Data). Methods and analysisUsing international expert consensus and empirical evidence, the INSPECT-IPD tool will be developed using five stages: (1) compiling a list of IPD trustworthiness checks, (2) evaluating the usefulness and ease of interpretation of the checks when applying them to a collection of presumed authentic and fabricated IPD datasets, (3) a Delphi survey to determine which checks are supported by expert consensus, (4) a series of consensus meetings for selection of checks to be included in the draft tool and finally (5) prospective testing of the draft tool in: a) the production of systematic reviews, and b) the journal editorial process for RCT submissions, leading to refinement based on user feedback. Ethics and disseminationThe University of Manchester ethics decision tool determined that ethical approval was not required (18 June 2024). This project includes secondary research and surveys of healthcare researchers on topics relating to their work. All results will be published as preprints and open-access articles, and the final tool will be freely available. STRENGTHS AND LIMITATIONS OF THIS STUDYO_LIAn international consensus process and empirical evidence will be used to develop the tool. C_LIO_LIThe development and dissemination of the tool will involve key stakeholders. C_LIO_LIIn the absence of a gold-standard test for problematic data, this tool should not be interpreted as a diagnostic instrument for trustworthiness. Instead, it will assist researchers in assessing the trustworthiness of a study. C_LIO_LIThe tool will only be applicable when individual participant data (IPD) can be accessed. Where IPD can be accessed, the ability to assess trustworthiness will be improved. C_LI

18

Performance of Large Language Models in Automated Medical Literature Screening: A Systematic Review and Meta-analysis

Chenggong, X.; Weichang, K.; Liuting, P.; Diaoxin, Q.; Yuxuan, Y.; Bin, W.; Liang, H.

2026-03-19 epidemiology 10.64898/2026.03.17.26348656 medRxiv

Top 0.1%

6.5%

Show abstract

ObjectiveTo systematically evaluate the diagnostic performance of large language models (LLMs) in automated medical literature screening and to determine their potential role in supporting evidence synthesis workflows. MethodsA systematic review and meta-analysis was conducted according to PRISMA DTA guidance. PubMed, Web of Science, Embase, the Cochrane Library and Google Scholar were searched from 1 January 2022 to 17 November 2025. Studies assessing LLMs for automated title and abstract screening or full-text eligibility assessment in medical literature were included. Diagnostic accuracy metrics were extracted and pooled using a bivariate random effects model and hierarchical summary receiver operating characteristic (HSROC) analysis. Subgroup analyses and meta-regression were performed to explore sources of heterogeneity. ResultsEighteen studies published between 2023 and 2025 were included. In title and abstract screening, the pooled sensitivity was 0.92 and pooled specificity was 0.94. The SROC area under the curve (AUC) reached 0.98. In full-text screening, pooled sensitivity and specificity both reached 0.99 and the AUC was 0.99. Prompt strategies incorporating examples or chain-of-thought reasoning significantly improved sensitivity. Across studies, most models were deployed without task specific fine tuning and still achieved strong performance. Subgroup analyses and meta regression did not identify significant sources of heterogeneity. Many studies also reported substantial efficiency gains, including large reductions in screening workload, time and cost. ConclusionLLMs demonstrate high diagnostic accuracy for automated medical literature screening, particularly in full-text assessment. These models show strong potential as high sensitivity assistive tools that can substantially reduce manual screening burden while supporting evidence synthesis. Further methodological optimization and validation in large scale real-world settings are required to establish their long term role in evidence-based medicine.

19

Standardisation of terminology, calculation and reporting for assigning exposure duration to drug utilisation records from healthcare data sources: the CreateDoT framework

Riera-Arnau, J.; Paoletti, O.; Gini, R.; Thurin, N. H.; Souverein, P. C.; Abtahi, S.; Duran, C. E.; Pajouheshnia, R.; Roberto, G.

2026-02-19 epidemiology 10.64898/2026.02.18.26346576 medRxiv

Top 0.1%

6.4%

Show abstract

BackgroundIn pharmacoepidemiological studies, days of treatment (DoT) duration associated with individual electronic drug utilization records (DUR) are usually missing. Researcher-defined duration (RDD) calculation approaches, as opposed to data-driven approaches, can be used to estimate DoT based on the specific choices and assumptions made by investigators. These are usually underreported or even undocumented. We aimed to develop a framework for the standardization of terminology, formulas, implementation, and reporting of possible RDD approaches. MethodsA systematic classification of RDD calculation approaches was developed via expert consensus. Universal concepts used to operationalise RDDs were identified and described using standard terminologies. An open-source R function, CreateDoT, was created to implement the formulas universal concepts as input parameter. A step-by-step workflow was developed to facilitate implementation and reporting. ResultsRDD approaches were classified in two main classes: I) daily dose (DD)-based calculation approaches (n=3 formulas), and II) fixed-duration approaches (n=2). Seven universal concepts were identified to describe the five corresponding generalized formulas for DoT calculation. Input parameters of the CreateDoT function can be retrieved from source data through its mapping to universal concepts, or inputted by the investigator based on the chosen calculation approach. The input file structure itself represents a standard reporting template for documenting investigators assumptions and methodological choices adopted for DoT calculation. ConclusionsThe CreateDoT framework can facilitate the documentation and reporting of RDD approaches for DoT calculation, increasing transparency and reproducibility of pharmacoepidemiological studies regardless of the data model used, and facilitates sensitivity analyses to evaluate the impact of alternative assumptions in DoT calculation.

20

Integrating stakeholder perspectives in modeling routine data for therapeutic decision-making

Pfaffenlehner, M.; Dressing, A.; Knoerzer, D.; Wagner, M.; Heuschmann, P.; Scherag, A.; Binder, H.; Binder, N.

2026-02-18 epidemiology 10.64898/2026.02.18.26346074 medRxiv

Top 0.1%

6.4%

Show abstract

BackgroundRoutinely collected health data are increasingly used to generate real-world evidence for therapeutic decision-making. Yet, stakeholders, including clinicians, pharmaceutical industry representatives, patient advocacy groups, and statisticians, prioritize different aspects of data quality, analysis, and interpretation. Without explicit consideration of these perspectives, analyses risk being fragmented, misaligned with end-user needs, or lacking transparency. MethodsWe developed a stakeholder-inclusive conceptual framework for modeling routine health data, informed by an interdisciplinary workshop and supported by targeted literature examples. The framework maps stakeholder priorities to methodological requirements and identifies analytical strategies that enable integration of diverse perspectives. ResultsClinicians prioritize interpretability and clinical relevance; the pharmaceutical industry emphasizes regulatory compliance and real-world evidence generation; patient groups highlight transparency, inclusion of patient-reported outcomes, and privacy protection; and statisticians focus on bias control and methodological rigor. Our framework illustrates how these priorities can be explicitly incorporated into modeling strategies. Multistate models exemplify a methodological approach that operationalizes these requirements by capturing dynamic disease trajectories, integrating intermediate outcomes, and offering graphical interpretability. Beyond specific methodological choices, clinical research relies fundamentally on statistical expertise. Depending on the research goal, statisticians roles can range from providing statistical consultations for standard analyses to applying or adapting advanced methods for more complex analyses to developing new methods for research questions that require novel approaches due to their specific characteristics. ConclusionsThe stakeholder-inclusive framework provides methodological guidance for designing analyses of routine health data that are clinically meaningful, scientifically rigorous, and socially acceptable. By aligning the research question with the intended perspective from the beginning, it supports more robust and transparent evidence generation, with multistate models serving as a flexible tool to operationalize this integration.